arXiv: 1508.00429vl [q-bio.NC] 3 Aug 2015 


A three-threshold learning rule approaches the maximal capacity of 
recurrent neural networks 

Alireza Alemi 1,2 ’*, Carlo Baldassi 1,2 , Nicolas Brunei 3 , Riccardo Zecchina 1,2 

1 Human Genetics Foundation (HuGeF), Turin, Italy 

2 DISAT, Politecnico di Torino, Turin, Italy 

3 Departments of Statistics and Neurobiology, University of Chicago, USA 
* alemi@polito.it 


Abstract 

Understanding the theoretical foundations of how memories are encoded and retrieved in neural 
populations is a central challenge in neuroscience. A popular theoretical scenario for modeling 
memory function is the attractor neural network scenario, whose prototype is the Hopfield model. 
The model simplicity and the locality of the synaptic update rules come at the cost of a poor 
storage capacity, compared with the capacity achieved with perceptron learning algorithms. Here, 
by transforming the perceptron learning rule, we present an online learning rule for a recurrent 
neural network that achieves near-maximal storage capacity without an explicit supervisory error 
signal, relying only upon locally accessible information. The fully-connected network consists of 
excitatory binary neurons with plastic recurrent connections and non-plastic inhibitory feedback 
stabilizing the network dynamics; the memory patterns to be memorized are presented online 
as strong afferent currents, producing a bimodal distribution for the neuron synaptic inputs. 
Synapses corresponding to active inputs are modified as a function of the value of the local fields 
with respect to three thresholds. Above the highest threshold, and below the lowest threshold, 
no plasticity occurs. In between these two thresholds, potentiation/depression occurs when the 
local field is above/below an intermediate threshold. We simulated and analyzed a network of 
binary neurons implementing this rule and measured its storage capacity for different sizes of the 
basins of attraction. The storage capacity obtained through numerical simulations is shown to 
be close to the value predicted by analytical calculations. We also measured the dependence of 
capacity on the strength of external inputs. Finally, we quantified the statistics of the resulting 
synaptic connectivity matrix, and found that both the fraction of zero weight synapses and the 
degree of symmetry of the weight matrix increase with the number of stored patterns. 


Author Summary 

Recurrent neural networks have been shown to be able to store memory patterns as fixed point 
attractors of the dynamics of the network. The prototypical learning rule for storing memories 
in attractor neural networks is Hebbian learning, which can store up to 0.138A uncorrelated 
patterns in a recurrent network of N neurons. This is very far from the maximal capacity 2N, 
which can be achieved by supervised rules, e.g. by the perceptron learning rule. However, these 
rules are problematic for neurons in the neocortex or the hippocampus, since they rely on the 
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computation of a supervisory error signal for each neuron of the network. We show here that the 
total synaptic input received by a neuron during the presentation of a sufficiently strong stimulus 
contains implicit information about the error, which can be extracted by setting three thresholds 
on the total input, defining depression and potentiation regions. The resulting learning rule 
implements basic biological constraints, and our simulations show that a network implementing 
it gets very close to the maximal capacity, both in the dense and sparse regimes, across all values 
of storage robustness. The rule predicts that when the total synaptic inputs goes beyond a 
threshold, no potentiation should occur. 


Introduction 

One of the fundamental challenges in neuroscience is to understand how we store and retrieve 
memories for a long period of time. Such long-term memory is fundamental for a variety of 
our cognitive functions. A popular theoretical framework for storing and retrieving memories in 
recurrent neural networks is the attractor network model framework m ■ Attractors, i.e. stable 
states of the dynamics of a recurrent network, are set by modification of synaptic efficacies in 
a recurrent network. Synaptic plasticity rules specify how the efficacy of a synapse is affected 
by pre- and post-synaptic neural activity. In particular, Hebbian synaptic plasticity rules lead 
to long-term potentiation (LTP) for correlated pre- and post-synaptic activities, and long-term 
depression (LTD) for anticorrelated activities. These learning rules build excitatory feedback 
loops in the synaptic connectivity, resulting in the emergence of attractors that are correlated 
with the patterns of activity that were imposed on the network through external inputs. Once 
a set of patterns become attractors of a network (in other words when the network “learns” 
the patterns), upon a brief initial activation of a subpopulation of neurons, the network state 
evolves towards the learned stable state (the network “retrieves” a past stored memory), and 
remains in that state after removal of the external inputs (and hence maintains the information 
in short-term memory). The set of initial network states leading to a memorized state is called 
the basin of attraction , whose size determines how robust a memory is. The attractor neural 
network scenario was originally explored in networks of binary neurons I ML and then extended 
from the 90s to networks of spiking neurons M- 

Experimental evidence in different areas of the brain, including inferotemporal cortex [Sllllj 
and prefrontal cortex has provided support for the attractor neural network framework, 

using electrophysiological recordings in awake monkeys performing delayed response tasks. In 
such experiments, the monkey has to maintain information in short-term (working) memory in 
a ‘delay period’ to be able to perform the task. Consistent with the attractor network scenario, 
some neurons exhibit selective persistent activity during the delay period. This persistent activity 
of ensembles of cortical neurons has thus been hypothesized to form the basis of the working 
memory of stimuli shown in these tasks. 

One of the most studied properties of attractor neural network as a model of memory is 
its storage capacity, i.e. how many random patterns can be learned in a recurrent network of 
N neurons in the large N limit. Storage capacity depends both on the network architecture 
and on the synaptic learning rule. In many models, the storage capacity scales with N. In 
particular, the Hopfield network ^ that uses a Hebbian learning rule has a storage capacity 
of 0.138./V in the limit of N —> oo [T5]. Later studies showed how the capacity depends on 
the connection probability in a randomly connected network mm and on the coding level 
(fraction of active neurons in a pattern) |18 (ll'9l . A natural question is, what is the maximal 
capacity of a given network architecture, over all possible learning rules? This question was 
answered by Elizabeth Gardner, who showed that the capacity of fully connected networks of 
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binary neurons with dense patterns scales as 2 N }20j . a storage capacity which is much larger 
than the one of the Hopheld model. The next question is what learning rules are able to saturate 
the Gardner bound? A simple learning rule that is guaranteed to achieve this bound is the 
perceptron learning rule (PLR) |21j applied to each neuron independently. However, unlike the 
rule used in the Hopheld model, the perceptron learning rule is a supervised rule that needs an 
explicit “error signal” in order to achieve the Gardner bound. While such an error signal might 
be available in the cerebellum , it is unclear how error signals targeting individual neurons 

might be implemented in cortical excitatory synapses. Therefore, it remains unclear whether and 
how networks with realistic learning rules might approach the Gardner bound. 

The goal of the present paper is to propose a learning rule whose capacity approaches the 
maximal capacity of recurrent neural networks by transforming the original perceptron learning 
rule such that the new rule does not explicitly use an error signal. The perceptron learning 
rule modifies the synaptic weights by comparing the desired output with the actual output to 
obtain an error signal, subsequently changing the weights in the opposite direction of the error 
signal. We argue that the total synaptic inputs (‘local fields’) received by a neuron during the 
presentation of a stimulus contain some information about the current error (i.e. whether the 
neuron will end up in the right state after the stimulus is removed). We use this insight to build 
a field dependent learning rule that contains three thresholds separating no plasticity, LTP and 
LTD regions. This rule implements basic biological constraints: (a) it uses only information 
local to the synapse; (b) the new patterns can be learned incrementally, i.e. it is an online rule; 
(c) it does not need an explicit error signal; (d) synapses obey Dale’s principle, i.e. excitatory 
synapses are not allowed to have negative weights. We studied the capacity and the size of 
the basins of attraction for a binary recurrent neural network in which excitatory synapses are 
endowed with this rule, while a global inhibition term controls the global activity level. We 
investigated how the strength of external fields and the presence of correlations in the inputs 
affect the memory capacity. Finally, we investigated the statistical properties of the connectivity 
matrix (distribution of synaptic weights, degree of symmetry). 


Results 

The network 

We simulated a network of N binary (McCulloch-Pitts) neurons, fully-connected with excitatory 
synapses (Fig[T|4). All the neurons feed a population of inhibitory neurons which is modeled as 
a single aggregated inhibitory unit. This state-dependent global inhibition projects back onto 
all the neurons, stabilizing the network and controlling its activity level. At each time step, the 
activity (or the state) of neuron i (i = 1... TV) is described by a binary variable s t e {0,1}. The 
state is a step function of the local field Vi of the neuron: 

Si = 0 [Vi - 9 ), (1) 

where 0 is the Heaviside function (0 (x) = 1 if x > 0 and 0 otherwise) and 9 is a neuronal 
threshold. The local field Vi represents the overall input received by the neuron from its excitatory 
and inhibitory connections (Fig |TJ3) . The excitatory connections are of two kinds: recurrent 
connections from within the excitatory population, and external inputs. 

The recurrent excitatory connections are mediated by synaptic weights, denoted by a matrix 
W whose elements Wij (the weight of the synapse from neuron j to i) are continuous non-negative 
variables (wij 6 [0, oo); wu = 0). In the following, and in all our simulations, we assume that the 
weights are initialized randomly before the training takes place (see Materials and Methods). 
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Patterns presented as strong external fields (x) 



B 



Figure 1: A sketch of the network and the neuron model. A. Structure of the network. 
The fully-connected network consists of N binary (s,; £ {0,1}) neurons and an aggregated in¬ 
hibitory unit. The global inhibition is a function of the state of the network and the external 
fields, i.e. I(x,s). A memory pattern £ is encoded as strong external fields, i.e. x = Xt; and 
presented to the network during the learning phase. B. Each neuron receives excitatory recurrent 
inputs (thin black arrows) from the other neurons, a global inhibitory input (red connections), 
and a strong binary external field (Xi £ {0, A}; thick black arrows). All these inputs are summed 
to obtain the total field, which is then compared to a neuronal threshold 0; the output of the 
neuron is a step function of the result. 


4 








Therefore, in the absence of external inputs, the local field of each neuron i is given by: 


N 

Vi=^2 W i3 S 3 ~ (s) , (2) 

. 7=1 


where Tq (s) represents the inhibitory input. 

For the sake of simplicity, we simulated a synchronous update process, in which the activity 
of each neuron Sj is computed from the local field Vi at the previous time step, and all updates 
happen in parallel. 

The network was designed so that, in absence of external input and prior to the training 
process, it should spontaneously stabilize itself to some fixed overall average activity level / 
(fraction of active neurons, or sparseness), regardless of the initial conditions. In particular, we 
aimed at avoiding trivial attractors (the all-off and all-on states). To this end, we model the 
inhibitory feedback (in absence of external inputs) as a linear function of the overall excitatory 
activity: 


Ms) — Hq + A 


-fN 


( 3 ) 


The parameters Hq and A can be understood as follows: Hq is the average inhibitory activity when 
the excitatory network has the desired activity level /, i.e. when YliLi s * = fN ; A measures 
the strength of the inhibitory feedback onto the excitatory network. This expression can be 
interpreted as a first-order approximation of the inhibitory activity as a function of the excitatory 
activity around some reference value fN, which is reasonable under the assumption that the 
deviations from fN are small enough. Indeed, by properly setting these two parameters in 
relation to the other network parameters (such as 9 and the average connection strength) it is 
possible to achieve the desired goal of a self-stabilizing network. 

In the training process, the network is presented a set of p patterns in the form of strong 
external inputs, representing the memories which need to be stored. We denote the patterns 
as {£ M } (where p = 1 ...p and G {0,1}), and assume that each entry is drawn randomly 
and independently. For simplicity, the coding level / for the patterns was set equal to the 
spontaneous activity level of the network, i.e. = 1 with probability /, 0 otherwise. During 
the presentation of a pattern p, each neuron i receives an external binary input Xi = X. 
where X denotes the strength of the external inputs, which we parameterized as X = 'yy/N. In 
addition, the external input also affects the inhibitory part of the network, eliciting a response 
which indirectly downregulates the excitatory neurons. We model this effect as an additional 
term Hi in the expression for the inhibitory term (Eq. [3]), which therefore becomes: 


ltf,g) = H 0 + H 1 ^^i+\C£si-f N ), ( 4 ) 

^ i=1 

The general expression for the local field Vi then reads: 

N 

Vi = WjjSj +Xj- l(x, s). (5) 

3 = 1 

In the absence of external fields, Xi = 0 for all i, and thus Eqs. [I] and [5] reduce to the previous 
expressions Eqs. [3] and [2] 

The goal of the learning process is to find values of w^ ’s such that the patterns {£ M } become 
attractors of the network dynamics. Qualitatively, this means that, if the training process is 
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successful, then whenever the network statelets sufficiently close to one of the stored patterns, 
i.e. whenever the Hamming distance d = £ )=1 l£f — s*| between the current network state and 
a pattern fi is sufficiently small, the network dynamics in the absence of external inputs should 
drive the network state towards a fixed point equal to the pattern itself (or very close to it). 
The general underlying idea is that, after a pattern is successfully learned, some brief external 
input which initializes the network close to the learned state would be sufficient for the network 
to recognize and retrieve the pattern. The maximum value of d for which this property holds is 
then called the basin of attraction size (or just basin size hereafter for simplicity); indeed, there 
is generally a trade-off between the number of patterns which can be stored according to this 
criterion and the size of their basin of attraction. 

More precisely, the requirement that a pattern is a fixed point of the network dynamics in 
the absence of external fields can be reduced to a condition for each neuron i (cfr. Eqs.[]]and[5]): 

V* : 0 

This condition only guarantees that, if the network is initialized into a state s = then it will 
not spontaneously change its state, i.e. it implements a zero-size basin of attraction. A simple 
way to enlarge the basin size is to make the requirement in Eq. [ 6 ] more stringent, by enforcing a 
more stringent constraint for local fields: 

£f=i Wij $ - 1 ( 0 , < 6 - fVNe if $ = 0 , ‘ ' 

where e > 0 is a robustness parameter. When e = 0, we recover the previous zero-basin-size 
scenario; increasing e we make the neurons’ response more robust towards noise in their inputs, 
and thus we enlarge the basin of attraction of the stored patterns (but then fewer patterns can 
be stored, as noted above). 



N 

Y 

3 =1 




( 6 ) 


The three-threshold learning rule (3TLR) 

In the training phase, the network is presented with patterns as strong external fields x t . Patterns 
are presented sequentially in random order. For each pattern /x, we simulated the following 
scheme: 

Step 1: The pattern is presented (i.e. the external inputs are set to X£f). A single step 
of synchronous updating is performed (Eqs. [T1 [4] and [5]). If the external inputs are strong enough, 
i.e. 7 is large enough, this updating sets the network in a state corresponding to the presented 
pattern. 

Step 2: Learning occurs. Each neuron i may update its synaptic weights depending on 1) 
their current value wf, 2) the state of the pre-synaptic neurons, and 3) the value of the local 
field Vi. Therefore, all the information required is locally accessible, and no explicit error signals 
are used. The new synaptic weights w 1 ^ 1 are set to: 


{ w ij ~ V s j i 

Wij+VSj, 

< 4 , 
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if 9q < Vi < 8 
if 8 < Vi < 8\ 
otherwise, 


( 8 ) 


where 77 is the learning rate, and 0 q and 9\ are two auxiliary learning thresholds set as 


(9) 

( 10 ) 


0o = 0 - (7 + e) fVN 
0 i = 0 + (7 + e) fy/N. 

We refer to this update scheme as “three-threshold learning rule” (3TRL). After some number 
of presentations, we checked whether the patterns are learned by presenting a noisy version of 
these patterns, and checking whether the patterns (or network states which are very close to the 
patterns) are fixed points of the network dynamics. 

When N 1, 7 is large enough, and H\ = fX, the update rule described by Eq. [S] is 
essentially equivalent to the perceptron learning rule for the task described in Eq. [3 This can 
be shown as follows (see also Fig [3 for a graphical representation of the case / = 0.5 and e = 0): 
when a stimulus is presented, the population of neurons is divided in two groups, one for which 
Xi = 0 and one for which Xi = X. The net effect of the stimulus presentation on the local 
field has to take into account the indirect effect through the inhibitory part of the network (see 
Eq. 0]), and thus is equal to —fX for the Xi = 0 population and to (1 — /) X for the Xi = X 
population. Before learning, the distribution of the local fields across the excitatory population, 
in the limit N —> 00 , is a Gaussian whose standard deviation is proportional to y/N, due to the 
central limit theorem; moreover, the parameter Hq is set so that the average activity level of the 
network is /, which means that the center of the Gaussian will be within a distance of order y/N 
from the neuronal threshold 9 (this also applies if we use different values for the spontaneous 
activity level and the pattern activity level). Therefore, if X = 7 y/N is large enough, the state 
of the network during stimulus presentation will be effectively clamped to the desired output, 
i.e. Si = for all i. This fact has two consequences: 1) the local field potential can be used to 
detect the desired output by just comparing it to the threshold, and 2 ) each neuron i will receive, 
as its recurrent inputs { Sj the rest of the pattern Furthermore, due to the choice 

of the secondary thresholds 9q and 9\ in Eqs. [9] and [10] the difference between the local field 
and 0 o (or 0 i) during stimulus presentation for the Xi = 0 population (or Xi = X , respectively) 
is equal to the difference between the local field and 0 — fy/Ne (or 9 + fy/Ne, respectively) in 
the absence of external stimuli, provided the recurrent inputs are the same. Therefore, the value 
of the local field t>,; during stimulus presentation in relation to the three thresholds 0 , 0 o and 
0i is sufficient to determine whether an error is made with respect to the constraints of Eq. [3 
and which kind of error is made. Following these observations, it is straightforward to map the 
standard perceptron learning rule on the 4 different cases which may occur (see Fig (3), resulting 
in Eq. [5] 

In Fig [3] we demonstrate the effect of the learning rule on the distribution of the local field 
potentials as measured from a simulation (with / = 0.5 and e = 1.2): the initial distribution 
of the local fields of the neurons, before the learning process takes place and in the absence of 
external fields, is well described by a Gaussian distribution centered on the neuronal threshold 
0 (see Fig 0^) with a standard deviation which scales as y/N. During a pattern presentation, 
the resulting distribution becomes a bimodal one; before learning takes place, the distribution is 
given by the sum of two Gaussians of equal width, centered around 9q + fy/Ne and 9\ — fy/Ne 
(Fig [3(3). The left Gaussian corresponds to the cases where Xi = 0 and the right one to the cases 
where Xi = X. Having applied the learning rule, we observe that the depression region (i.e. the 
interval (0o,0)) and the potentiation region (i.e. (0,0i)) gets depleted (Fig 00). In the testing 
phase, when the external inputs are absent, the left and right parts of the distribution come 
closer, such that the distance between the two peaks is equal to at least 2 efy/N (Fig 03). This 
margin between the local fields of the ON and OFF neurons makes the attractors more robust. 
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Original perceptron rule 


3TLR 


d= 0 



x, =0 X/ = X 



Figure 2: The three-threshold learning rule (3TLR), and its relationship with the stan¬ 
dard perceptron learning rule (PLR). The perceptron learning rule modifies the synaptic 
weights by comparing the desired output with the actual output to obtain an error signal, subse¬ 
quently changing the weights in the opposite direction of the error signal (see the table in the left 
panel). For a pattern which is uncorrelated with the current synaptic weights, the distribution 
is Gaussian (in the limit of large N), due to the central limit theorem. Hq is set such that, on 
average, a fraction / of the local fields are above the neuronal threshold 9 ; in the case of / = 0.5, 
this means that the Gaussian is centered on 9 (left panel). In our model (Fig [1J3), the desired 
output is given as a strong external input, whose distribution across the population is bimodal 
(with two delta functions on Xi = 0 and x* = X); therefore, the distribution of the local fields 
during stimulus presentation becomes bimodal as well (right panel). The left and right bumps 
of this distribution correspond to cases where the desired outputs are zero and one, respectively. 
Note that, since the external input also elicits an inhibitory response, the neurons in the net¬ 
work which are not directly affected by the external input (i.e. those with desired output equal 
to zero) are effectively hyperpolarized. If X is sufficiently large, the two distributions do not 
overlap, and the four cases of the PLR can be mapped to the four regions determined from the 
three thresholds, indicated by vertical dashed lines (see text). 






















WITHOUT EXTERNAL INPUT 


WITH EXTERNAL INPUT 



Figure 3: Distribution of local fields before and after learning for / = 0.5 and non-zero 
robustness. A. Before learning begins, the distribution of local field of neurons is a Gaussian 
distribution (due to central limit theorem) centered around neuronal threshold 9 both for neurons 
with the desired output zero (OFF neurons) and with the desired output one (ON neurons). The 
goal is to have the local field distribution of ON neurons (red curve) to be above the threshold 9, 
and that of OFF neurons to be below 9. B. Once any of the to-be-stored patterns are presented 
as strong external fields, right before the learning process starts, the local field distribution of the 
OFF neuron shifts toward the left-side centered around 9o+fey/N, whereas the distribution of the 
ON neurons moves toward the right-side, centered around 9 1 — fey/N, with a negligible overlap 
between the two curves if the external field is strong enough. Thanks to the strong external 
fields and global inhibition, the local fields of the ON and OFF neurons are well separated. C. 
Due to the learning process, the local fields within the depression region [i.e. (9q, 0)] get pushed 
to the left-side, below 9q 7 whereas those within the potentiation region get pushed further to 
the right-side, above 0\. If the learning process is successful, it will result in a region (0 o ,0i) 
which no longer contain local fields, with two sharp peaks on 9q and 9\. D. After successful 
learning, once the external fields are removed, the blue and red curves come closer, with a gap 
equal to 2 fey/N. The larger the robustness parameter e, the more the gap between the left- and 
right-side of the distribution. Notice that now the red curve is fully above 9 which means those 
neurons remain stably ON, while the the blue curve is fully below 9 , which means those neurons 
are stably OFF. Therefore the corresponding pattern is successfully stored by the network. 

Storage capacity 

Since our proposed learning rule is able to mimic (or approximate, depending on the parameters) 
the perceptron learning rule, which is known to be able to solve the task posed by Eq.[7]whenever 
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a solution exists, we expect that a network implementing such rule can get close to maximal 
capacity in terms of the number of memories which it can store at a given robustness level. 
The storage capacity, denoted by a = p/N, is measured as a ratio of the maximum number of 
patterns p which can successfully be stored to the number of neurons N, in the limit of large N. 
As mentioned above, it is a function of the basin size. 

We used the following definition for the basin size: a set of p patterns is said to be successfully 
stored at a size 6 if, for each pattern, the retrieval rate when starting from a state in which a 
fraction b of the pattern was randomized is at least 90%. The retrieval rate is measured by the 
probability that the network dynamics is able to bring the network state to an attractor within 
1% distance from the pattern, in at most 30 steps. The distance between the state of the network 
and a pattern p is measured by the normalized Hamming distance jj l s i — I- Therefore, 

at coding level / = 0.5, reaching a basin size b means that the network can successfully recover 
patterns starting from a state at distance 6/2. 

Fig 0J4 shows the maximal capacity as a function of the basin size for a simulated network 
of N = 1001 neurons. We simulated many pairs of (cqe) with different random seeds, obtaining 
a probability of success for each pair. The red line shows the points for which the probability of 
successful storage is 0.5, and the error bars span 0.95 to 0.05 success probability. The capacity 
was optimized over the robustness parameter e. The maximal capacity (the Gardner bound) in 
the limit of N —> oo at the zero basin size is a c = 2 for our model (see Materials and Methods for 
the calculation), as for a network with unconstrained synaptic weights [20]. In Fig[4]4, we also 
compare our network with the Hopfield model. Our network stores close to the maximal capacity 
at zero basin size, at least eleven times more than the Hopfield model. Across the range of basin 
sizes, 3TLR achieves more than twice the capacity that can be achieved with the Hopfield model. 

The enlargement of the basin of attraction was achieved by increasing the robustness pa¬ 
rameter e. We computed the maximal theoretical capacity as a function of e at —> oo (see 
Materials and Methods) and compared it to our simulations, and to the maximal theoretical 
capacity of the Hopfield network. The results are shown in Fig 0J3. For any given value of e, 
the cyan curve shows the maximum a for which the success ratio with our network was at least 
0.5 across different runs. The difference between the theory and the experiments in our model 
can be ascribed to several factors: the finite size of the network; the choice of the finite learning 
rate 77, and the fact that we imposed a hard limit on the number of pattern presentations (see 
number of iterations in Table [1]), while the perceptron rule for excitatory synaptic connectivity 
is only guaranteed to be optimal in the limit of 77 —> 0, with a number of presentations inversely 
proportional to 77 [25] . Note that the correspondence between the PLR and the 3TLR is only 
perfect in the large 7 limit, and is only approximate otherwise, as can be shown by comparing 
explicitly the synaptic matrices obtained by both algorithms on the same set of patterns (see 
Materials and Methods.) 

A crucial ingredient of the 3TLR is having a strong external input which effectively acts as a 
supervisory signal. How strong do the external fields need to be? How much does the capacity 
depend on this strength? To answer these questions, we measured the maximum number of 
stored patterns as a function of the parameter 7 which determines the strength of external fields 
as X = 1 VN. This parameter, in fact, determines how far the two Gaussian distributions of the 
local field are; as shown in Fig[2] the distance between the two peaks of the distribution is X. For 
large enough 7, the overlap of these two distributions is negligible and the capacity is maximal; 
but as we lower 7, the overlap increases, causing the learning rule to make mistakes, i.e. when 
it should potentiate, it depresses the synapses and vice versa. In our simulations with N = 1001 
neurons in the dense regime / = 0.5 at a fixed epsilon e = 0.3, we varied 7 and computed the 
maximum a that can be achieved with a fixed number of iterations (1000). The capacity indeed 
gradually decreases as 7 decreases, until it reaches a threshold, below which there is a sharp 
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Figure 4: Critical capacity as a function of the basin size and the robustness param¬ 
eter. A. The red plot shows the critical capacity as a function of the size of the basins of 
attraction (N = 1001 neurons in the dense regime / = 0.5) when the strength of the external 
field is large (7 = 6) such that the ON and OFF neuronal populations are well separated. The 
points indicate 0.5 probability of successful storage at a given basin size, optimized over the 
robustness parameter e . The error bars show the [0.95, 0.05] probability interval for successful 
storage. The blue plot shows the performance of the Hopfield model with N = 1001 neurons. 
The maximal capacity at zero basin size (the Gardner bound) is equal to 2. B. To compare the 
result of simulation of our model with the analytical results, we plotted the critical capacity as 
a function of the robustness parameter e. The dark red curve is the critical capacity versus e 
for our model obtained form analytical calculations (see Materials and Methods), the cyan line 
shows the result of simulations of our model, and the dark blue shows the Gardner bound for 
a network with no constraints on synaptic weights. The difference between the two theoretical 
curves is due to the constraints on the weights in our network. 


drop of capacity (see Fig [5]). With the above values for the parameters, this transition occurs at 
7 « 2.4. 
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Figure 5: Dependence of the critical capacity on the strength of the external input. 

We varied the strength of the external field ( 7 ) in order to quantify its effect on the learning 
process. The critical capacity is plotted as a function of 7 at a fixed robustness e = 0.3 in the 
dense regime / = 0.5. The simulations show that there is a very sharp drop in the maximum a 
when 7 goes below ss 2.4. 



Figure 6 : Capacity as a function of the robustness parameter e at sparseness / = 0.2. 

The theoretical calculations is compared with the simulations for / = 0.2. Note that the capacity 
in the sparse regime is higher than in the dense regime. 


The 3TLR can also be adapted to work in a sparser regime, at a coding level lower than 0.5. 
However, the average activity level of the network is determined by Hq 1 and their relationship 
also involves the variance of the distribution of the synaptic weights when / ^ 0.5 (see Materials 
and Methods). During the learning process, the variance of the weights changes, which implies 
that the parameter Hq must adapt correspondingly. In our simulations, this adaptation was 
performed after each complete presentation of the whole pattern set. In practice, this additional 
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self-stabilizing mechanism could still be performed in an unsupervised fashion along with (or 
in alternation with) the learning process. Using this adjustment, we simulated the network at 
/ = 0.2 and compared the results with the theoretical calculations. As shown in Fig [Gj we can 
achieve at least 70% of the critical capacity across different values of the robustness parameter 
e. 



Figure 7: Capacity as a function of correlations in the input patterns, for / = 0.2 at 
e = 3.0. Patterns are organized in categories, with a correlation c with the prototype of the 
corresponding category (see text). 

We also investigated numerically the effect of correlations in the input patterns. The PLR 
is able to learn correlated patterns as long as a solution to the learning problem exists. As the 
3TLR approximates the PLR, we expect the 3TLR to be able to learn correlated patterns as 
well. As a simple model of correlation, we tested patterns organized in L categories mm- Each 
category was defined by a randomly generated prototype. Prototypes were uncorrelated from 
category to category. For each category, we then generated p/L patterns independently with a 
specified correlation coefficient c with the corresponding prototype. We show in Fig [7] the results 
of simulations with L = 5, / = 0.2 and e = 3. The figure shows that the learning rule reaches a 
capacity that is essentially independent of c, in the range 0 < c < 0.75. 

Statistical properties of the connectivity matrix 

We next investigated the statistical properties of the connectivity matrix after the learning pro¬ 
cess. Previous studies have shown that the distribution of synaptic weights in perceptrons with 
excitatory synapses becomes at maximal capacity a delta function at zero weight, plus a truncated 
Gaussian for strictly positive weights [7511751150] . Our model differs from this setting because of 
the global inhibitory feedback. Despite this difference, the distribution of weights in our network 
bear similarities with the results obtained in these previous studies: the distribution exhibits a 
peak at zero weight (‘silent’, or ‘potential’ synapses), while the distribution of strictly positive 
weights resembles a truncated Gaussian. Finally, the fraction of silent synapses increases with 
the robustness parameter (see Fig [ED- 
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Figure 8: Synaptic weight distributions. Comparing the distributions of the synaptic weights 
at critical capacity for three different values of robustness obtained from simulation. The distri¬ 
bution of weights approaches a Dirac-delta distribution at zero plus a truncated Gaussian. As 
the patterns become more robust, the center of the partial Gaussian shifts towards the left, and 
the number of silent synapses increases. 


We have also computed the degree of symmetry of the weight matrix. The symmetry degree 
is computed as the Pearson correlation coefficient between the reciprocal weights in pairs of 
neurons. We observe a general trend towards an increasingly symmetric weight matrix as more 
patterns are stored, for all values of the robustness parameter e (see Fig El) ■ 



Memory load (a) 


Figure 9: The degree of symmetry of the weight matrix. The Pearson correlation co¬ 
efficient between and Wji is computed at different values of a for three values of e. As a 
increases the weight matrix tends to be more symmetric, but gets saturated for high a. For the 
same values of a, as the robustness increases, the correlation also increases, so the weight matrix 
becomes more symmetric. Error bars (across 10 runs) are smaller than the symbols. 
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Discussion 


We presented a biologically-plausible learning rule that is characterized by three thresholds, and 
is able to store memory patterns close to the maximal storage capacity in a recurrent neural 
networks without the need of an explicit “error signal”. We demonstrated how the learning rule 
can be considered a transformed version of the PLR in the limit of a strong external field. Our 
network implements the separation between excitatory and inhibitory neurons, with learning 
occurring only at excitatory-to-excitatory synapses. We simulated a recurrent network with 
N = 1001 binary neurons, reaching to a c = 1.6 at zero basin size. We then used a robustness 
parameter e to enlarge the basin size. The simulations showed that we are close to the theoretical 
capacity across the whole investigated range of values of e. We expect that as N increases and 
the learning rate gets smaller, this difference would go to zero. 

Two crucial ingredients of the 3TLR are necessary: (1) strong external inputs, (2) three 
learning thresholds which are set according to the statistics of inputs to the neuron. The learning 
rule only uses information that is local to a synapse and corresponding neurons. Like classic 
Hebbian learning rules, our 3TLR works in an online fashion. In addition, it can also perform as 
a ‘palimpsest’ f3H - f33| : in case the total number of patterns exceeds the maximal capacity (at a 
certain basin size) the network begins to forget patterns that are not being presented anymore. 

Comparison with other learning rules 

The 3TLR can be framed in the setting of the classic Bienenstock-Cooper-Munro (BCM) the¬ 
ory j34][35], with additional requirements to adapt it to the attractor network scenario. The 
original BCM theory uses firing-rate units, and prescribes that synaptic modifications should be 
proportional to (1) the synaptic input, and (2) a function <p(v) of the total input v (or, equiv¬ 
alently, of the total output). The function (j>{v ) is subject to two conditions: (1) <f)(v) > 0 (or 
< 0) when v > 9 (or < 9, respectively); (2) ^ (0) = 0. The parameter 9 is also assumed to 
change, but on a longer time scale (such that the changes reflect the statistics of the inputs); this 
(metaplastic) adaptation has the goal of avoiding the trivial situations in which all inputs elicit 
indistinguishable responses. This (loosely specified) framework ensures that, under reasonable 
conditions, the resulting units become highly selective to a subset of the inputs, and has been 
mainly used to model the developmental stages of primary sensory cortex. The arising selectiv¬ 
ity is spontaneous and completely unsupervised: in absence of further specifications, the units 
become selective to a random subset of the inputs (e.g. depending on random initial conditions). 

Our model is defined on simpler (binary) units; however, if we define 4> ( v ) = 0 (v — 9) O (9\ — v)— 
O (9 — v) 0 (v — 9q), then tf> behaves according to the prescriptions of the BCM theory. Further¬ 
more, we have essentially assumed the same slow metaplastic adaptation mechanism of BCM, 
even though we have assigned this role explicitly to the inhibitory part of the network (see Ma¬ 
terials and Methods). On the other hand, our model has additional requirements: (1) =0 

when v < 9 q ot v > 9\, (2) plasticity occurs during presentation of external inputs, which in 
turn are strong enough to drive the network towards a desired state. The second requirement 
ensures that the network units become selective to a specific subset of the inputs, as opposed 
to a random subset as in the original BCM theory, and thus that they are able to collectively 
behave as an attractor network. The first requirement ensures that each unit operates close to 
critical capacity. Indeed, these additional requirements involve extra parameters with respect to 
the BCM theory, and we implicitly assume these parameters to also slowly adapt according to 
the statistics of the inputs during network formation and development. 

A variant of the BCM theory, known as ABS rule hhgh] introduced a lower threshold for 
LTD, analogous to our 9q, motivated by experimental evidence; however, a high threshold for LTP, 


15 


analogous to our 6 i, was not used there, or — to our knowledge — in any other BCM variant. The 
idea of stopping plasticity above some value of the ‘local field’ has been introduced previously to 
stabilize the learning process in feed-forward networks with discrete synapses P^MD] . Our study 
goes beyond these previous works in generalizing such a high threshold to recurrent networks, 
and showing that the resulting networks achieve close to maximal capacity. 

Comparison with data and experimental predictions 

In vitro experiments have characterized how synaptic plasticity depends on voltage m and fir¬ 
ing rate 02], both variables that are expected to have a monotonic relationship with the total 
excitatory synaptic inputs received by a neuron. In both cases, a low value of the controlling 
variable leads to no changes; intermediate values lead to depression; and high values to potenti¬ 
ation. These three regimes are consistent with the three regions for v < 6\ in Fig [2] The 3TLR 
predicts that a fourth region should occur at sufficiently high values of the voltage and/or firing 
rates. Most of the studies investigating the dependence of plasticity on bring rate or voltage have 
not reported a decrease in plasticity at high values of the controlling variables, but these studies 
might have not increased sufficiently such variables. To our knowledge, a single study has found 
that at high rates, the plasticity vs rate curve is a decreasing function of the input rate [43] . 

Another test of the model consists in comparing the statistics of the synaptic connectivity with 
experimental data. As it has been argued in several recent studies [251l281130ll441l45j , networks with 
plastic excitatory synapses are generically sparse close to maximal capacity, with a connection 
probability that decreases with the robustness of information storage, consistent with short range 
cortical connectivity Ham]. Our network is no exception, though the fraction of silent synapses 
that we observe is significantly lower than in models that lack inhibition. Furthermore, network 
that are close to maximal capacity tends to have a connectivity matrix that has a significant 
degree of symmetry, as illustrated by the over-representation of bidirectionally connected pairs 
of neurons, and the tendency of bidirectionally connected pairs to form stronger synapses than 
unidirectionally connected pairs as observed in cortex ms, except in barrel cortex [45] • Again, 
the 3TLR we have proposed here reproduces this feature (Fig [9]), consistent with the fact that 
the rule approaches the optimal capacity. 

Future directions 

Our network uses the simplest possible single neuron model [50] ■ One obvious direction for future 
work would be to implement the learning rule in a network of more realistic neuron models such as 
bring rate models or spiking neuron models. Another potential direction would be to understand 
the biophysical mechanisms leading to the high threshold in the 3TLR. In any case, we believe 
the results discussed here provide a signibcant step in the quest for understanding how learning 
rules in cortical networks can optimize information storage capacity. 


Materials and Methods 

Simulation 

The main equations of the network, the neuron model, the learning rule, and the criteria for 
stopping the learning algorithm are outlined in the Results section, Eqs. Q][7] We present here 
additional details about network simulations. 
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Network setup before learning process 


Before applying the learning rule, we required the network to have stable dynamics around a 
desired activity level /. A network with only excitatory neurons is highly unstable and typically 
converges towards the trivial all-off and all-on states; therefore, we implemented a global inhibi¬ 
tion such that the network operates around activity level /. The basal inhibitory term (Hq) and 
the inhibitory reaction term (Hi) are defined as: 


H 0 = (. N - 1 )(fw - V>) + H-\f)y/(N - 1 )fa w (11) 

Hi = f'yVN^l ( 12 ) 

where H ( x) = Aerfc and H~ x is the inverse of H, ^ is defined as 9 = (N — 1 )’ip; w 

and a w are the mean and standard deviation of the synaptic weights, respectively. With these 
definitions the network dynamics is stable in the sense that the activity level converges to / very 
fast, regardless of the initial condition. 

In Eq. EH we see that Hq depends on the activity level / and on the standard deviation of 
the weights a w . In the dense regime, / = 0.5, we have H~ 1 ( 0.5) = 0, therefore the rightmost 
term of Eq. |TT] vanishes, which means that in this regime Hq is independent of a w . However, 
in sparser regimes, the network must be endowed with a mechanism to adjust for the changes 
in standard deviation, otherwise the learning process would bring the network out of the stable 
state, changing the basal activity level. In contrast, the mean synaptic efficacy w does not change 
significantly during the learning process. 

In all our simulations, the initial values for {wij} were sampled from a Gaussian distribution 
with mean and standard deviation equal to one, after which negative values were set to zero. 
This has the effect the tu™* is slightly higher than one. We also set wu = 0 for all i. 

Table [T] shows the values of the parameters used in the simulations, in the dense and sparse 
regimes. 


Table 1: 

Table of parameters in 

the simulation 

Parameter name 

Value in dense regime 

Value in sparse regime 

N 

1001 

1001 

A = wff* 

~1.08 

wl.08 

f 

0.5 

0.2 


0.35 

0.35 

6 

350 

350 

V 

0.01 [0.001 when e = 0] 

0.01 [0.001 when e = 0] 

7 

6.0 

12.0 

# of interations (learning) 

1000 [10000 when e = 0] 

1000 [10000 when e = 0] 

# of trials in test phase 

50 

50 


Direct comparison between the 3TLR and the PLR 

In order to determine the degree to which the 3TLR is able to mimic the PRL, and the effect 
of deviations from the latter rule, we tested both rules on the same tasks. In these simulations, 
every part of the simulation code was kept identical including the pseudo-random numbers 
used to choose the initial state and the arbitrary permutations for the update order of the units 
— except for the learning rule. We tested the network in the dense case / = 0.5, at e = 3, varying 
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Figure 10: Direct comparions of the 3TLR and the PLR. Success probability for the 3TLR 
at 7 = 6 (blue curve, left axis) and the PLR (red curve); the results for the 3TLR at 7 = 12 
are identical to those of the PLR (red curve). The orange points show the absolute difference of 
weights between the final values of the weights for the PLR at 7 = 6 and the PLR (right axis): 
the points show the median of the distribution, while the error bars span the 5th-95th percentiles, 
showing that, while the distribution is concentrated at near-zero values, outliers appear at the 
critical capacity of the 3TLR algorithm. (Note that the average value of the weights is in all cases 
approximately 1.08; also compare the discrepancies with the overall distribution of the weights, 

Fig El) 


the storage load a, using 10 samples for each point. We compared the probability of solving the 
learning task and the distribution of the discrepancies (absolute value of the differences) in the 
values of the resulting synaptic weights. We tested two values of the parameter 7 , 6 (as in FigE]) 
and 12. We found that at 7 = 12 there was absolutely no difference between the two rules, while 
at 7 = 6 the 3TLR performed slightly worse, and significant deviations from the PLR started to 
appear close to the maximal capacity of the 3TLR (see Fig fill. 


Analytical calculation of the storage capacity at infinite N 

Entropy calculation 


In this section, we present the details of the calculations for the typical storage capacity of our 
network in the limit of N —> 00 , using the Gardner analysis mm- 

The capacity is defined as the maximum value of a = p/N such that a solution to Eq. [7] can 
typically be found. 

We can rewrite Eq. [7] as 

otN / ( N 

V*: n e ( ( 2 er-!) I E^-^o-A 
M=1 V V =1 

where 


N 




3 =1 



- feVN^j = 1 (13) 


H 0 = Nfw-e + H~ 1 {f)a w y/jN 
A = w 
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(14) 

(15) 












Eg. [T3l becomes: 


olN / f N \ \ 

V* : II 0 ( ^ ~ 1} ( E K - _1 (/) ^ VW I - /eVivJ = 1 (16) 

Let us now consider a single unit i. We write <rf = (2£f — 1), and re-parametrize the weights 
as Wij = ^ — 1 e [—1, oo), and also define 

T = H~ x (/) yj~j (17) 

K = 4- (18) 

w 

Dropping the index i and neglecting terms of order 1, we obtain: 


aN 


N 


M =1 


<3=1 


II 0 1 (E w >(j - I - f K ^ i = i 




(19) 




Our goal is to compute the quenched entropy of this problem, i.e. the scaled average of the 
logarithm of the volume of W which satisfies the above equation: 

S = ^< lo S y W"} 

= Jf ^ I n « 0 ( W > + !)) il 0 ^ (E w &j ~ ' fKVN^j ^ (20) 

The computation proceeds along the lines of naiia, by using the so-called replica trick to 
perform the average of the logarithm of V, exploiting the identity: 


(logV) = lim 


C V n ) - 1 


ra-vO n 


( 21 ) 


performing the computation for integer values of n and using an analytical continuation to 
perform the limit n —> 0. we perform the calculation using the replica-symmetric (RS) Ansatz, 
which is believed to give exact results in the case of perceptron models with continuous weights. 
The final expression for the entropy depends on six order parameters; the first three are Q 1 q 
and M. whose meaning is 


Q 

q 

M 




N 


Y. W l' W 3 


sfN 


E^ 


where we used W a and W b to denote two different replicas of the system, which can simply be 
interpreted as two independent solutions to the constraint equation. Q is called the self-overlap, 
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and is equal to ( 2 “ L ) in our case, while q is the mutual-overlap. The remaining order parameters 
are the conjugate quantities Q , q and M. The entropy expression is: 


S (Q,q,M,Q,q,M) 


where 



ol2 a (Q, q, M) + 3f w (Q, q, A'l'j 


( 22 ) 


&A (Q,q,M) 
2w (q, q, m) 


J Du In W e -h{^Q)w 2 +w{uVn-p . (24) 


u 2 

We used the usual notation Du = du e = du G (u) to denote Gaussian integrals, and defined 
H (x) = ff° Du = ^erfc . In the following, we will also use the shorthand <3 (x ) = 

We also used the notation (-^ to denote the average over the output a, i.e. {tp ( a)) a = ftp (1) + 
(1 — /) tp (—1) for any function tp. The value of the order parameters is found by extremizing S. 
The notation and the following computations can be simplified using: 


A Q 

ta (a) 

A Q 
v (■ u , W) 


Q-q 

K — a (M — Ty/Q) + u (1 — /) y/q 

(i-f)VZQ 

q-2Q 

e -±AQW 2 +W(u^/q-M) 


(25) 

(26) 

(27) 

(28) 


The extremization of S then results in the system of equations: 


A Q 

q 

o 

Q 

A Q 

0 


a 

y/{Q- A Q) 


Duu(& (t a (u))) a 


-£q J Du (t<7 (u)) ta (u)) a + A Q 
J Du (Sf (t<j ( u))<j) a 



f™dW W 2 v(u,W) 
/“ dWv (u, W) 


1 /' f- [ dW Wv (u, W) 

Vq J /“ dWv (u, W) 



f™dW Wv ( u,W) 
f™ dWv (u, W) 


(29) 

(30) 

(31) 

(32) 

(33) 

(34) 
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The integrals over dW in the last three equations can be performed explicitly, yielding: 


Q 

A Q 

0 


q + M 2 + AQ 1 


AQ 2 


Du 


1 

A§ 


1 


M 

aO 


A Qq 
1 


AQ 
Du u& ^ 

[ Du<g I - 


(uy/$-M- Aq) 
uy/q — M + A Q 

Uyfq — M + A Q 


\/a0 


Uy/q — M + AQ 



( 35 ) 


(36) 


(37) 


Critical capacity 


At critical capacity, the space of the solutions shrinks to a point, and the mutual overlap tends 


to become equal to the self overlap: q —»• Q, i. 
parameters diverge as: 

Q = 
AQ = 

M = 


AQ —> 0. In this limit, the conjugate order 


c 

AQ 2 

(38) 

A 

AQ 

(39) 

By/C 

AQ 

(40) 


Using these conditions, and calling a c the critical value of a, the saddle point equations, [2H 
to 1331 become: 


Q 

A 

0 

c 

A 

0 


1 

A 

H 



Vc 

A 




(1 -A) 


Oi c Q ((l + Tff) H ( t ct ) - T a G(T a )) a 

a c (H (t ct )) ct 

(ct (G(r CT ) - T a H ( t ct ))) ct 


(41) 

(42) 

(43) 

(44) 

(45) 

(46) 


where we defined _ 

<t (M - Ty/Q) - K 

Ta ~ (i -f)VQ 


(47) 


These equations can be solved numerically to find the six parameters a c , Q, A , B , C and M. 

Note that in the special case K = 0 these equations have a degenerate solution with Q = 0 
and the same a c as in the case of unbounded synaptic weights (e.g. a c = 2 for / = 0.5). This is 
because in that case the original problem has the property that scaling all weights by a factor of 
x is equivalent to scaling the boundary w by a factor of a; -1 (see Eo.llfll): therefore, the optimal 
strategy is to exploit this property by setting x —> 0, i.e. effectively reducing the problem to the 
unbounded case. Of course, this strategy can only be pursued up to the available precision in a 
practical setting. 


21 













Acknowledgments 

References 

1. Hopfield JJ. Neural networks and physical systems with emergent collective computational 
abilities. Proc Natl Acad Sci USA. 1982;79:2554-2558. 

2. Amit DJ. Modeling brain function. Cambridge University Press; 1989. 

3. Hertz J, Krogh A, Palmer RG. Introduction to the Theory of Neural Computation. 
Addison-Wesley, Redwood City; 1991. 

4. Amit DJ, Brunei N. Model of global spontaneous activity and local structured activity 
during delay periods in the cerebral cortex. Cerebral Cortex. 1997;7:237-252. 

5. Brunei N, Wang XJ. Effects of neuromodulation in a cortical network model of object 
working memory dominated by recurrent inhibition. J Cornput Neurosci. 2001;11:63-85. 

6 . Mongillo G, Barak O, Tsodyks M. Synaptic Theory of Working Memory. Science. 
2008;319:1543. 

7. Barak O, Tsodyks M. Working models of working memory. Curr Opin Neurobiol. 
2014;25:20-24. 

8 . Fuster JM, Jervey JP. Inferotemporal neurons distinguish and retain behaviourally relevant 
features of visual stimuli. Science. 1981;212:952-955. 

9. Miyashita Y. Neuronal correlate of visual associative long-term memory in the primate 
temporal cortex. Nature. 1988;335:817-820. 

10. Miyashita Y, Chang HS. Neuronal correlate of pictorial short-term memory in the primate 
temporal cortex. Nature. 1988;331:68-70. 

11. Nakamura K, Kubota K. Mnemonic firing of neurons in the monkey temporal pole during 
a visual recognition memory task. J Neurophysiol. 1995;74:162-178. 

12. Fuster JM, Alexander G. Neuron activity related to short-term memory. Science. 
1971;173:652-654. 

13. Funahashi S, Bruce CJ, Goldman-Rakic PS. Mnemonic coding of visual space in the 
monkey’s dorsolateral prefrontal cortex. J Neurophysiol. 1989;61:331-349. 

14. Romo R, Brody CD, Hernandez A, Lemus L. Neuronal correlates of parametric working 
memory in the prefrontal cortex. Nature. 1999;399:470-474. 

15. Amit DJ, Gutfreund H, Sompolinsky H. Storing infinite numbers of patterns in a spin-glass 
model of neural networks. Phys Rev Lett. 1985;55:1530-1531. 

16. Sompolinsky H. Neural networks with nonlinear synapses and a static noise. Phys Rev A. 
1986;34:2571-2574. 

17. Derrida B, Gardner E, Zippelius A. An exactly solvable asymmetric neural network model. 
Europhys Lett. 1987;4:167-173. 


22 



18. Tsodyks M, Feigel’man MV. The enhanced storage capacity in neural networks with low 
activity level. Europhys Lett. 1988;6:101 -105. 

19. Buhmann J, Divko R, Schulten K. Associative memory with high information content. 
Phys Rev A. 1989;39:2689-2692. 

20. Gardner EJ. The phase space of interactions in neural network models. J Phys A: Math 
Gen. 1988;21:257-270. 

21. Rosenblatt F. Principles of neurodynamics. Spartan Books, New York; 1962. 

22. Marr D. A theory of cerebellar cortex. J Physiol. 1969;202:437-470. 

23. Albus JS. A theory of cerebellar function. Mathematical Biosciences. 1971;10:26-51. 

24. Ito M, Sakurai M, Tongroach P. Climbing fibre induced depression of both mossy fibre re¬ 
sponsiveness and glutamate sensitivity of cerebellar Purkinje cells. J Physiol. 1982;324:113- 
134. 

25. Clopatli C, Nadal JP, Brunei N. Storage of correlated patterns in standard and bistable 
Purkinje cell models. PLoS Comput Biol. 2012;8:el002448. 

26. Parga N, Virasoro MA. The ultrametric organization of memories in a neural network. 
J Phys France. 1986;47:1857-1864. 

27. Brunei N, Carusi F, Fusi S. Slow stochastic Hebbian learning of classes in recurrent neural 
networks. Network. 1998;9:123-152. 

28. Brunei N, Hakim V, Isope P, Nadal JP, Barbour B. Optimal information storage and the 
distribution of synaptic weights: perceptron versus Purkinje cell. Neuron. 2004;43:745-57. 

29. Brunei N, van Rossum MC. Lapicque’s 1907 paper: from frogs to integrate-and-fire. Biol 
Cybern. 2007;97:337-339. 

30. Clopath C, Brunei N. Optimal properties of analog perceptrons with excitatory weights. 
PLoS Comput Biol. 2013;9:el002919. 

31. M Mezard, Nadal JP, Toulouse G. Solvable models of working memories. J Physique. 
1986;47:1457-. 

32. Parisi G. A memory which forgets. J Phys A: Math Gen. 1986;19:L617. 

33. Arnit DJ, Fusi S. Dynamic learning in neural networks with material synapses. Neural 
Computation. 1994;6:957-982. 

34. Bienenstock E, Cooper L, Munro P. Theory for the development of neuron selectivity: 
orientation specificity and binocular interaction in visual cortex. J Neurosci. 1982;2:32-48. 

35. Jedlicka P. Synaptic plasticity, metaplasticity and BCM theory. Bratislavske lekarske listy. 
2002;103(4/5): 137-143. 

36. Broclier S, Artola A, Singer W. Intracellular injection of Ca2+ chelators blocks induction 
of long-term depression in rat visual cortex. Proceedings of the National Academy of 
Sciences. 1992;89(1):123-127. 


23 



37. Artola A, Singer W. Long-term depression of excitatory synaptic transmission and its 
relationship to long-term potentiation. Trends in neurosciences. 1993;16(ll):480-487. 

38. Arnit Y, Mascaro M. Attractor networks for shape recognition. Neural Comput. 
2001;13:1415-1442. 

39. Fusi S, Drew PJ, Abbott LF. Cascade models of synaptically stored memories. Neuron. 
2005;45:599-611. 

40. Brader JM, Senn W, Fusi S. Learning real-world stimuli in a neural network with spike- 
driven synaptic dynamics. Neural Comput. 2007;19:2881-2912. 

41. Ngezahayo A, Schachner M, Artola A. Synaptic activity modulates the induction of bidi¬ 
rectional synaptic changes in adult mouse hippocampus. J Neurosci. 2000;20:2451-2458. 

42. Kirkwood A, Rioult MC, Bear MF. Experience-dependent modification of synaptic plas¬ 
ticity in visual cortex. Nature. 1996;381:526-528. 

43. Wang H, Wagner JJ. Priming-induced shift in synaptic plasticity in the rat hippocampus. 
J Neurophysiol. 1999;82:2024-2028. 

44. Barbour B, Brunei N, Hakim V, Nadal JP. What can we learn from synaptic weight 
distributions? Trends Neurosci. 2007;30:622-629. 

45. Chapeton J, Fares T, LaSota D, Stepanyants A. Efficient associative memory stor¬ 
age in cortical circuits of inhibitory and excitatory neurons. Proc Natl Acad Sci USA. 
2012;109:E3614-3622. 

46. Kalisman N, Silberberg G, Markram H. The neocortical microcircuit as a tabula rasa. 
Proc Natl Acad Sci U S A. 2005;102:880-885. 

47. Song S, Sjostrom PJ, Reigl M, Nelson S, Chklovskii DB. Highly nonrandom features of 
synaptic connectivity in local cortical circuits. PLoS Biol. 2005;3:e68. 

48. Wang Y, Markram H, Goodman PH, Berger TK, Ma J, Goldman-Rakic PS. Heterogeneity 
in the pyramidal network of the medial prefrontal cortex. Nat Neurosci. 2006;9:534-542. 

49. Lefort S, Tonnn C, Floyd Sarria JC, Petersen CC. The excitatory neuronal network of the 
C2 barrel column in mouse primary somatosensory cortex. Neuron. 2009;61:301-316. 

50. McCulloch WS, Pitts WA. A logical calculus of the ideas immanent in nervous activity. 
Bull Math Biophys. 1943;5:115-133. 


24 



